In [27]:
%matplotlib inline
# RESEARCH IN PYTHON: Regression Discontinuity Analysis
# by J. NATHAN MATIAS March 23, 2015
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
This section is taken from Chapter 9 of Methods Matter by Richard Murnane and John Willett.
In chapter 9, Murnane and Willett introduce the practice of Regression Discontinuity Analysis, an method for estimating a causal effect in cases where a randomized trial is not possible.
Regression Discontinuity analysis is possible in cases where some kind of cutoff determines who goes into one group versus another. Instead of looking at the effect of the predictor on the outcome for the entire population, we compare predicted outcomes on both sides of the cutoff.
In Methods Matter, the authors use a paper by Joshua Angrist and Victor Lavy: "Using Maimonides Rule' to Estimate the Effect of Class Size on Scholastic Achievement." In this paper, the authors were not able to randomize class size assigment, but they were able to take advantage of a rule in Israeli schools that split classes into smaller sizes if the enrollment cohort was 41 or higher. This offered Angrist and Lavy a natural experiment for their research question, since we might expect cohorts on both sides of the cutoff to be very similar, except for the cutoff (the equality of expectation assymption).
Regression Discontinuity analysis uses observations further from the cutoff together with nearer ones "to project the estimated treatment effect at the cut-off." Practically, this technique regresses the outcome on two predictors:
The assumptions of this approach (beyond the assumptions of regression modeling) are:
In this analysis, we examine a dataset that includes school level data for:
In [28]:
# THINGS TO IMPORT
# This is a baseline set of libraries I import by default if I'm rushed for time.
import codecs # load UTF-8 Content
import json # load JSON files
import pandas as pd # Pandas handles dataframes
import numpy as np # Numpy handles lots of basic maths operations
import matplotlib.pyplot as plt # Matplotlib for plotting
import seaborn as sns # Seaborn for beautiful plots
from dateutil import * # I prefer dateutil for parsing dates
import math # transformations
import statsmodels.formula.api as smf # for doing statistical regression
import statsmodels.api as sm # access to the wider statsmodels library, including R datasets
from collections import Counter # Counter is useful for grouping and counting
import scipy
In [29]:
import urllib2
import os.path
if(os.path.isfile("class_size.dta")!=True):
response = urllib2.urlopen("http://www.ats.ucla.edu/stat/stata/examples/methods_matter/chapter9/angrist.dta")
if(response.getcode()==200):
f = open("class_size.dta","w")
f.write(response.read())
f.close()
class_df = pd.read_stata("class_size.dta")
In [30]:
print "=============================================================================="
print " OVERALL SUMMARY"
print "=============================================================================="
print class_df.describe()
The Forcing Variable (csize) is centered around the cutoff, and the cutoff predictor (small) is a dichotomous variable that indicates which side of the cutoff an observation lands.
In [34]:
def small(size):
if(size>=41):
return 1
return 0
# "first" distinguishes the groups that participate in the first diff.
def first(group):
groups = {1: 0, 2:0,
3: 1, 4:1}
return groups[group]
# SET UP Forcing Variable and Cutoff Predictor
class_df['small'] = class_df['size'].map(small)
class_df['csize'] = class_df['size'].map(lambda x: x-41)
# summarize the read variable by each class size group
class_df[(class_df['size']>=36) & (class_df['size']<=46)].boxplot("read", "csize")
plt.show()
In [36]:
window = class_df[(class_df['size']>=29) & (class_df['size']<=53)]
result = smf.ols(formula = "read ~ csize + small",
data = window).fit()
print result.summary()
plt.figure(num=None, figsize=(12, 6), dpi=80, facecolor='w', edgecolor='k')
plt.scatter(window.csize,window.read, color="blue")
l=window[window.csize<0].csize.count()
plt.plot(window.csize[0:l], result.predict()[0:l], '-', color="r")
plt.plot(window.csize[l:], result.predict()[l:], '-', color="r")
plt.axvline(x=-0.5,color="black", linestyle="--")
plt.title("Regression Discontinuity: Reading Scores by Class Size Before and After the Cutoff", fontsize="18")
Out[36]:
^^ In this model, we see a statistically significant effect of class size on reading scores (p<0.05), as expressed in the coefficient small and illustrated by plotting the predicted value from the model on the forcing variable.
Murnane and Willett explore a more complex example in Ludwig and Miller (2009) "Does Head Start Improve Children's Life Chances? Evidence from a Regression Discontinuity Design." In this paper, the authors identify a possible exogenous cutoff in the deployment of Head Start when the US Office of Economic Opportunity (OEO) offered Head Start grant-writing support to communities with county-level povery rates below 59.2%. The authors fit two regression discontinuity models:
Model 1 "examines differences in counties' funding for and participation in, the Head Start program as a consequence of receipt of grant-writing assistance or the lack of it. The results of these analyses showed that grant-writing assistance did indeed lead to differences in Head Start funding and participation immediately following receipt of the assistance." (187) Model 2 focused on "longer-term child health and schooling outcomes." In this model, "poverty-rate was again the forcing variable and exogenous assignment to grant-writing assistance again defined the experimental conditions. However, now, the impact of participating in Head Start on future child outcomes is the substantive focus, and so grant-writing assistance--which remains the principal question predictor--has become an exogenously assigned expression of intent to treat by Head Start" (187). This opens up two causal questions:
Murnane and Willett walk readers through the various models taken in this paper, as well as the sensitivity analyses conducted on the second set of models:
Murnane and Willet note that Regression Discontinuity can be convenient as an initial study design in politically complicated situations: "providing assistance to the 300 poorest counties, with the slightly less needy serving as controls, would be easier to defend ethically to members of Congress who were concerned about equity than would the assignment of poor counties randomly to treatment and control groups" (195). When doing this, there are several costs:
Murnane and Willett conclude by suggesting that readers further interested in this technique read Thomas Cook's 2008 paper, "Waiting for Life to Arrive: A History of the Regression-Discontinuity Design in Psychology, Statistics, and Economics" in the journal of Econometrics.